This project analyses Boulder B-cycle data to understand and document any patterns from 2013 to Early 2016. The data analysis in this project is through summaries and visualizations. Also part of the project was to apply Machine Learning on the numerical data to try and make predictions.
The analysis is divided into 3 big sections:
Summary of Data
Data Visualizations
Machine Learning to predict the Pass Type
Conclusion
## Rider.Home.System Rider.or.Operator.Number Entry.Pass.Type Bike.Number
## 1 Boulder B-cycle R1011535 24-hour 548
## 2 Boulder B-cycle R1011722 24-hour 742
## 3 Boulder B-cycle R1008367 Annual 578
## 4 Boulder B-cycle R1010650 24-hour 616
## 5 Boulder B-cycle R1008367 Annual 578
## 6 Boulder B-cycle R1055681 Annual 601
## Checkout.Date Checkout.Day.of.Week Checkout.Time Checkout.Station
## 1 5/20/2011 Friday 9:24:00 AM 15th & Pearl
## 2 5/20/2011 Friday 9:24:00 AM 15th & Pearl
## 3 5/20/2011 Friday 9:33:00 AM Broadway & Alpine
## 4 5/20/2011 Friday 9:34:00 AM Broadway & Alpine
## 5 5/20/2011 Friday 9:36:00 AM Broadway & Alpine
## 6 5/20/2011 Friday 9:39:00 AM UCAR Center Green
## Return.Date Return.Day.of.Week Return.Time Return.Station
## 1 5/20/2011 Friday 9:40:00 AM 26th @ Pearl
## 2 5/20/2011 Friday 9:54:00 AM 15th & Pearl
## 3 5/20/2011 Friday 9:36:00 AM Broadway & Alpine
## 4 5/20/2011 Friday 9:37:00 AM Broadway & Alpine
## 5 5/20/2011 Friday 9:39:00 AM Broadway & Alpine
## 6 5/20/2011 Friday 9:42:00 AM UCAR Center Green
## Trip.Duration..Minutes.
## 1 16
## 2 30
## 3 3
## 4 3
## 5 3
## 6 3
## Rider.Home.System Rider.or.Operator.Number
## Boulder B-cycle :243333 M9999957: 9499
## Denver B-cycle : 4666 M9999950: 5684
## Madison B-cycle : 201 M9999952: 5538
## Houston B-cycle : 113 R1028713: 4006
## Indy - Pacers Bikeshare: 74 M9999943: 3077
## GREENbike : 38 M9999998: 2835
## (Other) : 119 (Other) :217905
## Entry.Pass.Type Bike.Number Checkout.Date
## 24-hour : 83642 411 : 1821 6/25/2015: 703
## 7-day : 5585 584 : 1755 8/2/2015 : 650
## Annual :113041 666 : 1613 8/8/2015 : 639
## Maintenance : 37337 744 : 1608 7/28/2015: 635
## Semester (150-day): 8939 665 : 1607 6/26/2015: 621
## 699 : 1596 8/5/2015 : 621
## (Other):238544 (Other) :244675
## Checkout.Day.of.Week Checkout.Time Checkout.Station
## Friday :39020 12:16:00 PM: 467 Length:248544
## Monday :35182 12:26:00 PM: 455 Class :character
## Saturday :36603 12:45:00 PM: 447 Mode :character
## Sunday :28767 4:12:00 PM : 434
## Thursday :38079 5:05:00 PM : 433
## Tuesday :34903 12:12:00 PM: 432
## Wednesday:35990 (Other) :245876
## Return.Date Return.Day.of.Week Return.Time
## 6/25/2015: 706 Friday :39026 12:04:00 AM: 495
## 8/2/2015 : 651 Thursday :37881 1:13:00 PM : 451
## 8/8/2015 : 637 Saturday :36322 12:12:00 PM: 441
## 7/28/2015: 629 Wednesday:36042 1:51:00 PM : 439
## 6/26/2015: 624 Monday :35362 12:15:00 PM: 437
## 7/11/2015: 624 Tuesday :34944 12:52:00 PM: 436
## (Other) :244673 (Other) :28967 (Other) :245845
## Return.Station Trip.Duration..Minutes.
## Length:248544 Min. : -2.00
## Class :character 1st Qu.: 5.00
## Mode :character Median : 12.00
## Mean : 63.36
## 3rd Qu.: 26.00
## Max. :181607.00
##
There are 248544 observations of 13 variables. The variables include Checkout/Return Stations, Checkout/Return Time, Type of Pass, Day of the Week, Trip Duration, Bike Number and Rider/Operator number. Also included is a location dataset with latitude and longitude information along with other information about the Checkout/Return stations
There are some errors in the “Rider.Home.System” column. This data is supposed for Boulder but was set to Denver and Houston in some cases, this is not correct. Not a big issue because this variable/column data is not that important in the analysis because it’s a constant and doesn’t add value to the analysis.
NOTE: corrections were made to Checkout/Return Station “RTD”, which is really “14th & Canyon” but was entered incorrectly as “RTD”. This error was found later on in the project analysis but was corrected early on.
This section involves a lot of visualizations. It’s a combination of univariate and multivariate plots, with the focus on one variable at a time.
Fig1: Rider/Operator Count
Fig2: Rider/Operator Count seperated by Pass Type
NOTE: There were a lot of riders with 1-200 rides, so to understand any patterns better, the data was subset to riders with 200 or more rides.
The following trends can be noted from the plots in this section from the subset data:
Some riders really like to use B-cycle for their rides(Fig1). Faceting it by the pass type(Fig2), we get a better understanding of what type of passes they like to use. Annual Pass is the biggest winner among people who use the bikes often(not surprising) but there was a rider who did a little more than 200 rides using the 24 hour pass(surprised that the person didn’t think of other available pass types).
The number of rides by riders using Maintenance pass is very interesting, there are a lot of rides by a few users. This indicates that these were operators who regularly used and fixed bikes.
Fig3: Pass Type Count
Fig4: Pass Type Count seperatred by Day of the Week
There are 4 pass types as noted from the plot above(Fig3). It is clear that the Anuual pass is definitely the most popular, followed by the 24-hour type pass. 150-day and 7-day passes pale in comparison. Maintainance is another one which has relatively high use compared to 7-day and Semester(150-day) type.
From Fig4 one thing which stands out is that 24-hour pass type is used way more than Annual pass on the weekends. whereas on the weekdays Annual pass is still the most widely used. Semester and 7-day pass usage is comparatively very low.
Fig5: Bike# Count seperated by Pass Type
Fig6: Bike# Count seperated by Day of the Week
Fig7: Bike# Count seperated by Pass Type & Day of the Week
Was not expecting any trends when analyzing bike numbers but surprisingly there are some trends.
Except for 7-day and Semester pass types, the bike numbers in the middle seem to be most used for other pass types. This might be related to the stations they are at, as there are stations which are more popular than others, as we will see below.
Fig8: Day of the Week Count
Fig9: Day of the Week Count seperated by Pass Type
Friday overall is the most popular day of the week for ridership, followed by Thursday(surprising) and then Saturday. Monday, Tuesday & Wednesday usage is very close, whereas Sunday usage is markedly lower compared to other days(Fig8)
When looking at the data faceted by the pass type(Fig9), Annual pass holders like to use their passes on weekdays(the distribution is almost gaussian like). It is completely opposite for the users of 24-hour pass type, they like riding on weekends(as it was noted in the Pass Type section)
Maintenance rides are common on weekdays.
Semester pass holders like to use their pass on the weekdays with Tuesday being the most popular.
In the case of 7-day pass, there is no visible trend but Thursday is most popular followed by Friday & Saturday.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.00 5.00 12.00 63.36 26.00 181600.00
Fig10:Box plots of Trip Duration
Fig10:Box plots of Trip Duration
Fig11: Trip Duration Distribution
Fig12: Trip Duration Distribution seperated by Pass Type
Fig13: Trip Duration Distribution seperated by Day of the Week
Fig14: Trip Duration Distribution seperated by Pass Type and Day of the Week
This section also uses a subset of the data. There were a lot of outliers in trip duration, so the trip duration was subset to rides within and including 60 minutes.
Overall(Fig11), the trip duration is a gaussian distribution with the peak at 4-6 minutes and falls pretty hard and kind of stabilizes from the 32 minute mark.
Faceting it by pass type(Fig12), for the 24 hour pass type, the most common trip duration is 12-13 minutes, 10-13 minutes for 7-day pass, 4-6 minutes for Annual, 1 minute for Maintenance(quick maintenance rides!) and 6-8 minutes for semester type.
When looking at the plots by the day of the week(Fig 13), 4-6 minute period still is the most popular on weekdays but not on weekends. With 10-12 minute seeming to be more popular, may be this is due to the fact that people are not in a hurry on the weekends.
Combining the pass type and weekday(Fig 14), annual pass holders trip duration pattern doesn’t change much, Point 1. still holds true. For 24 hour pass type, the trip duration seems to be in the upper ranges, 10+ minutes. The 7-day pass type trip duration doesn’t show a clear pattern from the plots. 1-minute maintenance seems to be the most common turn around time. 6-8 minutes trip duration seems to still be the most common for semester type pass.
Fig15: Checkout Station Count
Fig16: Checkout Station Count seperated by Day of the Week
Fig17: Checkout Station Count seperated by Pass Type
Fig18: Trip Duration Distribution seperated by Checkout Station
15th & Pearl, 13th and Spruce are the 2 most poular check out stations in Boulder(Fig15). There is a close tie between 11th and Pearl and Municipal Building stations. Greenhouse and Gunbarrel North are the least used stations, 14th and Walnut office might be an error as this location doesn’t have lattitude, longitude listed.
Faceting it by the day of the week(Fig16), 15th & Pearl is still the most popular checkout station. With 13th and Spruce along with Municipal building being the 2nd most popular checkout stations from Mon-Thu and 11th & Pearl from Fri-Sun.
Analyzing the checkout stations by the pass type(Fig17), 15th & Pearl is still the most popular checkout station for all pass types except for the semester pass type. For the 24-hour pass type, 11th and Pearl is the 2nd most popular checkout station followed by 19th @ Broadway. The village seems to be the 2nd most popular station for the 7-day pass type. The distribution for Annual pass type doesn’t change much with the overall pattern noted in point 1 because this is the most popular pass type.
One thing to be noted are the spikes in maintenance(Fig16 & Fig17) in locations like The Village and 26th @ Pearl which are not in line with the overall checkout station popularity pattern. This might indicate that the bikes at those statiopns might have been subject to more rough use or a batch of bikes had a few defects.
Faceting the trip duration(Fig18) by Checkout station there are not any major surprises and the overall pattern across the popular stations seems to be still true, ride times were in the 6-10 minutes range.
Fig19: Return Station Count seperated by Pass Type
Fig20: Return Station Count seperated by Day of the Week
Fig21: Return Station Count seperated by Pass Type
Fig22: Trip Duration Distribution seperated by Return Station
No surprises from the Return Station analysis, most if not all of the points from the pervious section apply here as well.
Fig23: Checkout Date Distribution
Fig24: Checkout Date Distribution seperated by Pass Type
Fig25: Checkout Date Distribution seperated by Day of the Week
Fig26: Checkout Date Distribution seperated by Pass Type and Day of the Week
The number of checkouts has progressively increased over the years from 2013 to 2016(Fig23). There is definitly a pattern in terms of usage, the summer(May-August) months seeing an increase in checkouts with a dip on on either side of the summer months. This definitely makes sense as people tend to ride less in the winter months. Among the popular summer months, July-August have the biggest checkouts across the years
Viewing the plots by the type of the pass(Fig24), we can see that all pass types have seen an increase in usage since Boulder B-cycle was introduced. 7-day pass saw a big increase in the summer of 2015 and the Semester type pass also saw a big increase since it was introduced in early 2014.
Among the annual pass holders, October of 2015(Fig24) had more users than any other month in the warmer months. This is surprising, I guess October must have been warm or there must have been a lot of events in the Boulder area that month.
Maintenance generally follows the trend of an increase in the number of instances of maintenance in the summer months and a decrease in the colder months. One anomaly was that April of 2015 had the highest instances of maintenance for that year but it wasn’t the most popular month in terms of ridership. This might indicate that Boulder B-cycle was preparing in advance for the popular ridership summer months. This might be a good guess because the maintenance was lower in the months following April for 2015 across all pass types.
Analysing checkouts divided by the day of the week(Fig25). Only Tue-Wed deviate from the general trend that August is the most popular month followed by July. In the case of Tue-Wed the roles of July and August get reversed.
Doing a multivariate analysis(Fig26) we can see finer trends in popular days across months and across pass types but there are no new points(other than the ones already documented) to be noted down.
Fig27: Checkout Time Distribution
Fig28: Return Time Distribution
Fig29: Checkout Time Distribution seperated by Pass Type
Fig30: Return Time Distribution seperated by Pass Type
Checkout/Return start slowly a little before 7(Fig27), followed by a big increase just after 7:00. From that time on, the checkouts/returns slightly decrease but then increase again from 10:00 to 11:15, then seeing a dip again at around 13:00 followed by an increase till 15:00. There is a dip again followed by an increase in ridership after 18:00.
The return and checkout times follow each other closely because the overall the most popular riding time in Boulder is less than 10 minutes.
24 hour pass type holders checkout/return times(Fig29/30) start off strong in the morning and slowly decrease except for one spike at 10:00 and then starts increasing at around 14:30, hitting a peak at 18:00 and then slowly decreasing.
Among 7-day pass holders(Fig29/30) 14:30 seems to be the peak for checkout/returns. The increase in checkout/return starts at around 11:30 with the peak at 14:30 . This pattern also holds true for semester pass holders.
Annual pass type usage patterns follows the overall pattern described in point 1. Whereas, for maintenance, the peak is in the morning before 11:00 followed by a big dip and then a big increase after 15:00
Fig31: Checkout Time Distribution seperated by Day of the Week
Fig32: Return Time Distribution seperated by Day of the Week
Fig33: Checkout Time Distribution seperated by Pass Type & Day of the Week
Fig34: Return Time Distribution seperated by Pass Type & Day of the Week
Among 24 hour pass holders(Fig31/32) from Tue-Thu the checkout/return pattern is different from Fri-Mon. Tue-Thu checkout/returns doesn’t dip as much in the middle of the day compared to Fri-Mon.
7-day pass checkout/returns(Fig31/32) peak at around 15:00 from Mon-Thu, with Fri-Sun seeing peaks and drops throughout the day. The same pattern applies for semester pass holders.
Annual pass holders like to use the service around the 11:00, 15:00 and 18:00 on weekdays. Where as on Saturdays and Sundays the peak usage in the morning and evenings.
Maintenance peaks at 15:00 on weekdays and mornings/evenings on Saturdays and Sunday(with a big drop in maintenance in the middle)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Boulder,+Colorado&zoom=13&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Boulder,%20Colorado&sensor=false
Fig35: Heat Map of Checkout Stations
Fig36: Heat Map of Return Stationss
The size of the circle represents the overall number checkouts/returns per station since B-cycle started. From the two maps(Fig35 & Fig36) it is clear that the stations in downtown are most frequently used. The stations near downtown and in/near the University are behind the downtown stations in terms of usage.
# Subset the data keeping the necessary variables
mlsubset <- dataset[c(3, 13)]
# Create partition
trainIndex <- createDataPartition(mlsubset$Entry.Pass.Type, p = 0.8, list = FALSE, times = 1)
trainingset <- mlsubset[trainIndex, ]
testset <- mlsubset[-trainIndex, ]
# Get the necessary variables for analysis
# Split the data set for 10-fold cross validation, train on 9, test on 1 for all combinations
trainControl <- trainControl(method = "cv", number = 10)
metric <- "Accuracy"
# Evaluate 3 different algorithms, make sure the same seed is used
# Linear Discriminant Analysis
set.seed(7)
fit.lda <- train(Entry.Pass.Type~., data = trainingset, method = "lda",
metric = metric, trControl = trainControl)
## Loading required package: MASS
# Classification and Regression Tree
set.seed(7)
fit.cart <- train(Entry.Pass.Type~., data = trainingset, method = "rpart",
metric = metric, trControl = trainControl)
## Loading required package: rpart
# Naive Bayes
set.seed(7)
fit.nb <- train(Entry.Pass.Type~., data = trainingset, method = "nb",
metric = metric, trControl = trainControl)
## Loading required package: klaR
# Summarize accuracy of models
results <- resamples(list(lda = fit.lda, cart = fit.cart, nb = fit.nb))
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: lda, cart, nb
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## lda 0.4551 0.4552 0.4553 0.4554 0.4555 0.4557 0
## cart 0.6099 0.6123 0.6129 0.6134 0.6155 0.6167 0
## nb 0.5481 0.5549 0.5580 0.5568 0.5589 0.5613 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## lda 0.0007871 0.001297 0.001471 0.001569 0.001746 0.002487 0
## cart 0.3663000 0.370100 0.372800 0.372800 0.376200 0.377200 0
## nb 0.2117000 0.219100 0.224500 0.223000 0.226200 0.232500 0
# Dot plot of the results
dotplot(results)
# Compare against the test set
predictions <- predict(fit.cart, testset)
confusionMatrix(predictions, testset$Entry.Pass.Type)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 24-hour 7-day Annual Maintenance Semester (150-day)
## 24-hour 10553 428 4175 2578 371
## 7-day 0 0 0 0 0
## Annual 5518 653 17300 2356 1333
## Maintenance 657 36 1133 2533 83
## Semester (150-day) 0 0 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.6113
## 95% CI : (0.607, 0.6156)
## No Information Rate : 0.4548
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3685
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 24-hour Class: 7-day Class: Annual
## Sensitivity 0.6309 0.00000 0.7652
## Specificity 0.7710 1.00000 0.6361
## Pos Pred Value 0.5829 NaN 0.6370
## Neg Pred Value 0.8046 0.97753 0.7646
## Prevalence 0.3365 0.02247 0.4548
## Detection Rate 0.2123 0.00000 0.3480
## Detection Prevalence 0.3642 0.00000 0.5464
## Balanced Accuracy 0.7009 0.50000 0.7007
## Class: Maintenance Class: Semester (150-day)
## Sensitivity 0.33923 0.00000
## Specificity 0.95481 1.00000
## Pos Pred Value 0.57024 NaN
## Neg Pred Value 0.89100 0.96405
## Prevalence 0.15022 0.03595
## Detection Rate 0.05096 0.00000
## Detection Prevalence 0.08936 0.00000
## Balanced Accuracy 0.64702 0.50000
This section explored whether it was possible to use Machine Learning algorithms on the numeric data(trip duration) to predict the pass type with a high accuracy. Three algorithms were tested(LDA, CART and Naive Bayes). Among the 3, CART(or better known as Decision Tree) had the best results but the accuracy was still low <65%. Since trip duration was the only quantified numeric data the algorithms didn’t perform that well.
If there was another numeric variable which Boulder B-cycle had provided, may be the distance covered during each trip, it would have probably helped with the classification and accuracy of classification.
There were a lot of observations made in this document, listed here are the major finds from the dataset.
Annual Pass is the most used pass type on weekdays and 24 hour pass type on weekends.
Overall Friday is the most popular day for ridership.
4-6 Minutes is the most common trip duration on weekdays and 10-12 minutes on weekends. Maintenance rides are usually quick(1-2 minutes)
Downtown Stations are most frequently used in Boulder, followed by the stations in/near CU Boulder and near downtown.
The summer months are most popular for usage(October of 2015 was an anomaly). The winter months see considerable lower usage.
Boulder B-cycle usage saw a growth in Boulder since 2013, with 2016 seeing a big increase in usage.
The Checkout/Return times are like a sine wave, seeing peaks and troughs in usage throughout the day on weekdays. On weekends peak usage is in the mornings and evenings with a dip in the middle of the day.
Machine Learning Algorithm accuracy is < 65% with Kappa < 40%, the algorithm performance can be vastly improved if other numerical data like distance covered during each trip is available.